Skip to content

Conversation

@mvalsecchi-nv
Copy link

@mvalsecchi-nv mvalsecchi-nv commented Nov 10, 2025

Close #453

WIP as I have not tested it yet.

I used this RH article as that is the source of truth for RHOCP <> RHCOS matrix.

As per the supported matrix, GPU Operator only support RHOCP 4.14 or later.

I've added extra cases (4.12, 4.13), since some users might be interested in creating drivers for not NVIDIA supported environments, that can still receive updates from Red Hat with Extended Update Support Add-On. If that is not necessary, I can remove the lines covering 4.12, and 4.13.

Since in the GPU Operator docs we do instruct users to export the OS_TAG as per rhcos4.<x>, I believe the Makefile changes should not brake any existing automation script.

Let me find a lab to test it out, and update accordingly

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@@ -0,0 +1,34 @@
FROM nvcr.io/nvidia/cuda:13.0.1-base-ubi9
Copy link
Author

@mvalsecchi-nv mvalsecchi-nv Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately symlinks (from vgpu-manager/rhel8 to vgpu-manager/rhel9) would not cut it, as we pass the subdir, making those files (inside rhel8 unreachable from any other sibling folder).

Let me see if I can come up with a cleaner way, rather than duplicating all the files in vgpu-manager/rhel*

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems most directories do indeed have copies of the same files, so I'll leave the duplicates inside vpu-manager/rhel* folders instead of refactoring.

Signed-off-by: Michele Valsecchi <[email protected]>
Signed-off-by: Michele Valsecchi <[email protected]>
@mvalsecchi-nv
Copy link
Author

I tested commit 9a950fc155929a8235be7cab7022b7cd7882fa7c with OCP4.18.24, GPU Operator 25.10.0 and driver =580.95.02, and it work as expected.

Environment:

 $ oc get csv -n nvidia-gpu-operator
NAME                              DISPLAY               VERSION   REPLACES                         PHASE
gpu-operator-certified.v25.10.0   NVIDIA GPU Operator   25.10.0   gpu-operator-certified.v25.3.4   Succeeded

 $  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.24   True        False         2d20h 

 $  ls -lah vgpu-manager/rhel9
[...]
-rwxr-xr-x 1 user user  97M Nov 11 16:54 580.95.02-vgpu-kvm.run

Outcome:

 $ oc get pod -n nvidia-gpu-operator
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-55f5686c46-762nh                               1/1     Running   0          17h
nvidia-sandbox-device-plugin-daemonset-v6q7q                1/1     Running   0          16h
nvidia-sandbox-validator-w7pfn                              1/1     Running   0          16h
nvidia-vgpu-device-manager-22zpw                            1/1     Running   0          16h
nvidia-vgpu-manager-daemonset-418.94.202509100653-0-zmkcl   2/2     Running   0          16h

 $ oc get clusterpolicy -o yaml | less 
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: true
    image: vgpu-manager
    imagePullSecrets:
    - podman-registry-credentials
    - private-registry-secret
    repository: <redacted>.svc:5000/openshift
    version: 580.95.02
status:
  conditions:
  - lastTransitionTime: "2025-11-12T13:24:05Z"
    message: ClusterPolicy is ready as all resources have been successfully reconciled <===
    reason: Reconciled 
    status: "True"
    type: Ready <===

Let me test the refactored version, and also create a VM, to confirm everything work as intended, and I'll remove the WIP from the title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vGPU Manager Container Base Image for OpenShift Virtualization 4.14+

1 participant